Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

image.png

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

Summary of Application train

Summary Statistics

Commentary

We can see from the descriptive statistics for Days Birth, Days employed, Days registration, Days Id publish which is a negative value and is not expected.

Missing data for application train

Distribution of the target column

Explore the distribution of values taken on by the target variable.

Number of Days employed is an important feature that can be used for predicting risk. However, the histogram shows that the data is not logical.

There are number of applications that we can see from the histogram for those who have cars over 60 years old.

Correlation with the target column

The distribution of the top correlated features are plotted below.

Density plots of correlated features are plotted below

Applicants Age

Applicants occupations

distribution of credit amounts

Visualize income vs loan amount, identified by default

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

agg detour

Aggregate using one or more operations over the specified axis.

For more details see agg

DataFrame.agg(func, axis=0, *args, **kwargs**)

Aggregate using one or more operations over the specified axis.

Missing values in prevApps

feature engineering for prevApp table

feature transformer for prevApp table

Feature Engineering on Primary & Secondary Datasets

Merge secondardy dataset with Primary dataset's (application_train) target variable to understand correlation between target variable and the secondary dataset's features.

The following secondary datasets will be explored for correlation against the target variable.

Create Feature Aggregators

Engineer New Features

Engineer new features capturing relationship between income and credit amount as well as annuity and income for Application dataset

Engineer new features capturing range of annuity, application, and downpayment amounts from the Previous Application dataset

Build Pipeline for each Dataset

Engineer New features for Application Train Dataset

Prepare Datasets

Create Aggregate datasets after performing fit & transform

Join the labeled dataset

Perform data merging of primary application and secondary datasets.

Check presence of newly engineered features

Join the unlabeled dataset (i.e., the submission file)

Perform data merging of primary application and secondary datasets.

Processing pipeline

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

Please this blog for more details of OHE when the validation/test have previously unseen unique values.

HCDR preprocessing

Dataframe Column Selector

Numerical Pipeline Set-up

Categorical Attributes

Create Consolidated Data Pipeline

Use ColumnTransformer instead of FeatureUnion

Summarize Features Considered and Lengths

Feature Engineering

Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

Split Application into Train-Test split

Define Pipeline

Perform cross-fold validation and Train the model

Split the training data to 10 fold to perform Crossfold validation

Evaluation metrics

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.

from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75

Confusion Matrix

Tune Baseline Model Parameters with GridSearch

The baseline Logistic Regression model was tuned across different parameters evaluated for the following metrics:

Boxplot with CV results

Final Results

Kaggle submission via the command line API

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

report submission

Click on this link

Write-up

Abstract

The primary purpose of the HCDR project is to create a machine learning model which can accurately predict the customer behavior on repayment of the loan.

In the first phase of this project, we conducted basic exploratory data analysis on all the datasets, created a baseline pipeline, and selected key metrics. We then conducted a statistical analysis of the numerical and categorical features. By doing feature engineering for the highly correlated features, we were able to evaluate a better baseline.

The results we obtained in this phase indicate that there is no statistical significance between our baseline and best performing model, since the P-value of 0.051 indicates no statistical significance between the experiments. Both models have a 91.9% accuracy score and a 75% AUC across the training, validation, and test datasets.

Our ROC_AUC score for the Kaggle submission was 0.74306.

Project Description

Home Credit is an international non-bank financial institution that aims to lend people money regardless of their credit history. Home credit groups focus on providing a positive borrowing experience for customers who do not bank on traditional sources. Thus, Home Credit Group published a dataset on Kaggle with the goal of identifying and solving unfair loan rejection.

The purpose of this project is to create a machine learning model which can accurately predict the customer behavior on repayment of the loan. Our task is to form a pipeline to build a baseline machine learning model using logistic regression classification algorithms. The final model will be evaluated using a number of different performance metrics that we can use to create a better model. Businesses can use this model to identify if a loan is at risk to default. The new model that is built will ensure that the clients who are capable of repaying their loans are not rejected and that loans would be given with a principal, maturity, and repayment calendar that will allow their clients to be successful.

The results of the machine learning pipelines are measured by using these metrics: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Accuracy Score, Precision, Recall, Confusion Matrix, and Area Under ROC Curve (AUC).

The results of our pipelines will be analyzed and ranked. The most efficient pipeline will be submitted to the Kaggle competition for the Home Credit Default Risk (HCDR).

Workflow

We are implementing the following workflow outlined below. In Phase 0, we understood the project modelling requirements and outlined our plans. In Phase 1, we are performing the first among three iterations of the remainder of the workflow

alt

Data Description

The dataset contains 1 primary table and 6 seconday tables. \ \ Primary Tables

  1. application_train \ This Primary table includes the application information for each loan application at Home Credit in one row. This row includes the target variable of whether or not the loan was repaid. We use this field as the basis to determine the feature importance. The target variable is binary in nature based since this is a classification problem. \ \ The target variable takes on two different values:

    • '1' - client with payment difficulties: he/she had late payment more than N days on at least one of the first M installments of the loan in our sample
    • '0' - all other cases \ \ There are 122 variables and 307,511 data entries.
  2. application_test \ This table includes the application information for each loan application at Home Credit in one row. The features are the same as the train data but exclude the target variable. \ \ There are 121 variables and 48,744 data entries.

Secondary Tables

  1. Bureau \ This table includes all previous credits received by a customer from other financial institutions prior to their loan application. There is one row for each previous credit, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 17 variables and 1,716,428 data entries.

  2. Bureau Balance \ This table includes the monthly balance for a previous credit at other financial institutions. There is one row for each monthly balance, meaning a many-to-one relationship with the Bureau table. We could join it with bureau table by using bureau's ID, SK_ID_BUREAU. \ \ There are 3 variables and 27,299,925 data entries.

  3. Previous Application \ This table includes previous applications for loans made by the customer at Home Credit. There is one row for each previous application, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR. There are four types of contracts: a. Consumer loan(POS – Credit limit given to buy consumer goods) b. Cash loan(Client is given cash) c. Revolving loan(Credit) d. XNA (Contract type without values) \ \ There are 37 variables and 1,670,214 data entries.

  4. POS CASH Balance \ This table includes a monthly balance snapshot of a previous point of sale or cash loan that the customer has at Home Credit. There is one row for each monthly balance, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 8 variables and 10,001,358 data entries.

  5. Credit Card Balance \ This table includes a monthly balance snapshot of previous credit cards the customer has with Home Credit. There is one row for each previous monthly balance, meaning a many-to-one relationship with the Previous Application table.We could join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 23 variables and 3,840,312 data entries.

  6. Installments Payments \ This table includes previous repayments made or not made by the customer on credits issued by Home Credit. There is one row for each payment or missed payment, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 8 variables and 13,605,401 data entries.

Data Tasks

The following data preprocessing tasks need to be achieved to prepare the datasets after downloading and unzipping the main application and secondary datasets:

  1. Analyze missing values from application_train table and feature correlations with target variable.
  2. Examine correlations between primary dataset's target variable and features from each secondary dataset.
  3. Create pipelines for primary and secondary datasets that generate minimum, maximum, and mean metrics using aggregate functions.
  4. Transform for primary and secondary datasets using the pipelines.
  5. Perform feature engineering to build new features for _previousapplication dataset.
  6. Join the primary application dataset (labeled train and unlabeled test) with secondary tables on SK_ID_CURR. A left join is used so that any loan application record IDs that are missing secondary data are not dropped and will instead be imputed (strategy discussed in pipeline).
  7. Engineer new features around claim duration attributes and Occupation Type

EDA

For the Exploratory Data Analysis component of this phase, we did a precursor analysis on the data to ensure that our results would be accurate.

We looked at summary statistics for each table in the model. We primarily focused on the data distribution, identifying statistics such as the count, mean, standard deviation, minimum, IQR, and maximum.

We also looked at specific numerical and categorical features and visualized them. We created a heatmap to identify the correlation between each feature and the target variable. We also visualized the age, occupation, and distribution of credit amounts.

Please see the Exploratory Data Analysis section for our complete EDA.

Feature Engineering and transformers

In our feature engineering process, we created two types of features to enhance our dataset. First, we created new aggregate features based on aggregate functions to capture the minimum, maximum, and mean of numerical attributes across the primary and secondary datasets that were highly correlated with the target variable.

alt

We decided to engineer the following new features from the Application and Previous Application datasets:

In order to identify the highly correlated features, we created a simple function that took a secondary dataframe name as an input variable and generated a correlation matrix between all the features in the inputted dataframe and the primary dataset's target variable.

All the aggregate values were calculated from the original dataframes and a new of dataframes (comprising of primary and secondary datasets) were generated. After the secondary datasets were merged with the primary "application_train" dataset, the new consolidated application training dataframe had a total of 187 features (including the aggregate calculations for specific features).

Further, the top highly correlated features (positive and negative) were chosen from both the primary and secondary datasets. These features were then classified into numerical and categorical variables to form inputs for 2 individual pipelines. In total, our baseline model comprised of 53 features (46 numerical and 7 catgorical features).

(Please see Feature Engineering section and Feature Aggregator for more details)

Pipelines

Implementing Logistic Regression as a baseline model is a good starting point for classification tasks due to its easy implementation and low computational requirements. For the first experiment, we combined our data preparation pipeline and Logistic Regression with deafult parameters (penalty = 'l2', C = 1.0, solver = 'lbfgs', tol = 1e-4). We wanted to fine tune the regularization (l1 vs l2), tolerance, and C hyper parameters using Grid Search and compare the resulting best estimator with the baseline model. We used 5 fold cross-validation along with the hyperparameters to tune the model with GridSearchCV function in Scikit-learn.

Here is the high-level workflow for the model pipeline followed by detailed steps:

  1. Download data and perform data pre-processing tasks (joining primary and secondary datasets, transformation)
  2. Create data pipeline with highly correlated numerical and categorical features.
  3. Impute missing numerical attribute with mean values and categorical values with most frequent values.
  4. Apply ColumnTransformer to combine both Numerical and Categorical features.
  5. Create model with data pipeline and baseline model to fit training dataset
  6. Evaluate the model using accuracy score, AUC score, RMSE and MAE for train, validation and test datasets. Record the results in a dataframe.
  7. Perform Grid Search to tune the Logistic Regression model with regularization('l1', 'l2'), tolerance(0.0001, 0.00001, 0.0000001), and C(10, 1, 0.1, 0.01) hyper parameters and 5-fold cross-validation.
  8. Record the Grid Search results to the dataframe (expLog) and find the best estimator based on accuracy scores.

Two experiments were conducted in total to develop a baseline model.

Experimental results

Below is the resulting table for the two baseline models we developed on the given dataset.

Since HCDR is a Classification task, we used the following metrics to measure the Model performance.

MAE

The mean absolute error is the average of the absolute values of individual prediction errors over all instances in the test set. Each prediction error is the difference between the true value and the predicted value for the instance.

$$ \text{MAE}(\mathbf{X}, h_{\mathbf{\theta}}) = \dfrac{1}{m} \sum\limits_{i=1}^{m}{| \mathbf{x}^{(i)}\cdot \mathbf{\theta} - y^{(i)}|} $$

RMSE

This root mean square error is the normalized distance between the vector of predicted values and the vector of observed values. First, the squared difference between each observed value and predicted value is calculated. RMSE is the square root of the summation of these squared differences.

$$ \text{RMSE}(\mathbf{X}, h_{\mathbf{\theta}}) = \sqrt{\dfrac{1}{m} \sum\limits_{i=1}^{m}{( \mathbf{x}^{(i)}\cdot \mathbf{\theta} - y^{(i)})^2}} $$

Accuracy Score

This metric describes the fraction of correctly classified samples. In SKLearn, it can be modified to return solely the number of correct samples.Accuracy is the default scoring method for both logistic regression and k-Nearest Neighbors in scikit-learn.

Precision

The precision is the ratio of true positives over the total number of predicted positives.

Recall

The recall is the ratio of true positives over the true positives and false negatives. Recall is assessing the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0

Confusion Matrix

The confusion matrix, in this case for a binary classification, is a 2x2 matrix that contains the count of the true positives, false positives, true negatives, and false negatives.

AUC (Area under ROC curve)

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: ▪ True Positive Rate ▪ False Positive Rate

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1).

AUC is desirable for the following two reasons:

  1. AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
  2. AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.

p-value

p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.

We will compare the classifiers with the baseline untuned model by conducting two-tailed hypothesis test.

Null Hypothesis, H0: There is no significant difference between the two machine learning pipelines. Alternate Hypothesis, HA: The two machine learning pipelines are different. A p-value less than or equal to the significance level is considered statistically significant.

Discussion

In our first experiment for our baseline model, we received fairly high accuracy scores for our training and validation datasets at around 91.9%. Our test accuracy score was also similarly placed at 91.8%. While this suggests a robust model, the AUC values for training, test, and validation datasets are only between 74% to 75%. This means that there is a significant enough probabilty for false positives or false negatives. Our second experiment, in which we applied Grid Search to identify the best hyperparameters for Logistic Regression, also exhibited similar accuracy and AUC metrics. In both experiments, the RMSE and MAE values were fairly low. In addition, the tuned algorithm from Grid Search did not achieve statistical significance but came very close as the p-value was 0.051. Therefore, we fail to reject our null hypothesis as there is evidence to suggest that there is no significant difference between the two pipelines.

We believe that a high accuracy score and somewhat lower AUC curves might be partly the result of an imbalanced dataset with respect to the target variable values. From our EDA, we found that nearly 92% of our application training data had a target value of 0 (no default) with the remaining 8% having a value of 1. With such a low sample of data for loans that defaulted, it is likely that the accuracy score will end up being high as there is less scope for false positives or false negatives.

For our Kaggle submission, we used the baseline model with best parameters (second experiment) since the test accuracy was slightly better.

We opted out of using Mean Absolute Percentage Error (MAPE) as an evaluation metric (as originally proposed in Phase 0) because we were getting an undefined output ('Inf') due to division by zero in the denominator.

Conclusion

In the Home Credit Default Risk (HCDR) project, we are using Home Credit’s data to better predict repayment of a loan by a customer with little to no credit history.

In Phase 1 of our project, we designed and developed a process to ingest Home Credit's application and secondary client data, analyze the dataset features, transform and engineer the best parameters, and evaluate machine learning algorithms. Our workflow led us to a tuned algorithm that was not statistically significant compared to untuned logistic regression but came close with a p-value of 0.051

We developed and evaluated a baseline model using tuned logistic regression. Though our model exhibited high accuracy scores, there is still some room for improvement with respect to the AUC scores. In Phase 2, we plan to refine our model training process by adding additional features from the bureau datasets. While evaluating other algorithms listed in our project proposal, we also plan to capture log loss as part of our model evaluation metrics in order to get a complete picture of model performance.

Challenges

One of the major challenges faced was identifying a method to select the most relevant features. It was initially difficult to achieve a balance between too few or too many features for our baseline model. Also, as mentioned in the Conclusion, our team maintains a healthy skepticism of the results as we believe we need to resample our dataset in the next phase. We also had to focus a lot of time on getting our code to work smoothly and ensuring the basic data transformations, feature engineering, and set-up is correct in addition to analyzing the results of our baseline model.

Along the way, we faced several technical issues in developing this notebook:

Kaggle Submission

Below is the screenshot of our best kaggle submission.

References

Some of the material in this notebook has been adopted from here

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: